Goto

Collaborating Authors

 theoretical study


Query Complexity of Clustering with Side Information

Neural Information Processing Systems

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively.


Dimensionality reduction: theoretical perspective on practical measures

Neural Information Processing Systems

Dimensionality reduction plays a central role in real-world applications for Machine Learning, among many fields. In particular, metric dimensionality reduction where data from a general metric is mapped into low dimensional space, is often used as a first step before applying machine learning algorithms. In almost all these applications the quality of the embedding is measured by various average case criteria. Metric dimensionality reduction has also been studied in Math and TCS, within the extremely fruitful and influential field of metric embedding. Yet, the vast majority of theoretical research has been devoted to analyzing the worst case behavior of embeddings and therefore has little relevance to practical settings.


Query Complexity of Clustering with Side Information

Neural Information Processing Systems

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively.




A new methodology to decompose a parametric domain using reduced order data manifold in machine learning

Mang, Chetra, TahmasebiMoradi, Axel, Yagoubi, Mouadh

arXiv.org Machine Learning

We propose a new methodology for parametric domain decomposition using iterative principal component analysis. Starting with iterative principle component analysis, the high dimension manifold is reduced to the lower dimension manifold. Moreover, two approaches are developed to reconstruct the inverse projector to project from the lower data component to the original one. Afterward, we provide a detailed strategy to decompose the parametric domain based on the low dimension manifold. Finally, numerical examples of harmonic transport problem are given to illustrate the efficiency and effectiveness of the proposed method comparing to the classical meta-models such as neural networks.


An adaptive sampling algorithm for data-generation to build a data-manifold for physical problem surrogate modeling

Mang, Chetra, TahmasebiMoradi, Axel, Danan, David, Yagoubi, Mouadh

arXiv.org Machine Learning

Physical models classically involved Partial Differential equations (PDE) and depending of their underlying complexity and the level of accuracy required, and known to be computationally expensive to numerically solve them. Thus, an idea would be to create a surrogate model relying on data generated by such solver. However, training such a model on an imbalanced data have been shown to be a very difficult task. Indeed, if the distribution of input leads to a poor response manifold representation, the model may not learn well and consequently, it may not predict the outcome with acceptable accuracy. In this work, we present an Adaptive Sampling Algorithm for Data Generation (ASADG) involving a physical model. As the initial input data may not accurately represent the response manifold in higher dimension, this algorithm iteratively adds input data into it. At each step the barycenter of each simplicial complex, that the manifold is discretized into, is added as new input data, if a certain threshold is satisfied. We demonstrate the efficiency of the data sampling algorithm in comparison with LHS method for generating more representative input data. To do so, we focus on the construction of a harmonic transport problem metamodel by generating data through a classical solver. By using such algorithm, it is possible to generate the same number of input data as LHS while providing a better representation of the response manifold.


Generalized Decision Focused Learning under Imprecise Uncertainty--Theoretical Study

Shariatmadar, Keivan, Yorke-Smith, Neil, Osman, Ahmad, Cuzzolin, Fabio, Hallez, Hans, Moens, David

arXiv.org Artificial Intelligence

Decision Focused Learning has emerged as a critical paradigm for integrating machine learning with downstream optimisation. Despite its promise, existing methodologies predominantly rely on probabilistic models and focus narrowly on task objectives, overlooking the nuanced challenges posed by epistemic uncertainty, non-probabilistic modelling approaches, and the integration of uncertainty into optimisation constraints. This paper bridges these gaps by introducing innovative frameworks: (i) a non-probabilistic lens for epistemic uncertainty representation, leveraging intervals (the least informative uncertainty model), Contamination (hybrid model), and probability boxes (the most informative uncertainty model); (ii) methodologies to incorporate uncertainty into constraints, expanding Decision-Focused Learning's utility in constrained environments; (iii) the adoption of Imprecise Decision Theory for ambiguity-rich decision-making contexts; and (iv) strategies for addressing sparse data challenges. Empirical evaluations on benchmark optimisation problems demonstrate the efficacy of these approaches in improving decision quality and robustness and dealing with said gaps.


Review for NeurIPS paper: Rational neural networks

Neural Information Processing Systems

Additional Feedback: This work proposes a new activation function to sever deep learning architecture, providing a theoretical study about its complexity. This paper is well-written and provides a high-level of readability to most readers of the data mining community. However, the article would be significantly enhanced if the issues related to their motivation, technical analysis, and experiments are addressed. Detailed comments are given in the following: 1) Motivation – This paper proposes rational activation function as an alternative to ReLU, potentially avoiding the issue of vanishing gradient problem * The problem raised in this paper, i.e., some existing activation functions (e.g., sigmoid, logistic) can only handle the smooth signal, is a significant problem in deep neural network optimization since their derivative are zero for large value. Low-degree can save time, but is there any better configuration and why choose such type?


Reviews: Regularized Gradient Boosting

Neural Information Processing Systems

This paper proposes Rademacher generalization bounds for Regularized Gradient Boosting which encompasses various accelerated GB methods. Although there are still some work to be done in order to make the proposed algorithm derived from the theoretical study faster but the proposed theoretical study deserves publication.